High-performance LLM inference with `vLLM`

What is vLLM?

vLLM is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.

Why `vLLM`?

We tested vLLM and llama-cpp on Torch, and found vLLM performs better on Torch for Qwen2.5-0.5B-Instruct and Qwen2.5-7B-Instruct with 256 input and 256 output tokens. Based on our results, vLLM retains a performance advantage over llama-cpp across both model sizes, although the relative gap decreases as the model size increases. In addition, vLLM integrates more naturally with the Hugging Face/Torch ecosystem in HPC environment, whereas llama-cpp requires more manual setup and compatibility adjustment.

Model	Inference Server	Generated Throughput (tok/s)	Median Latency (ms)
0.5B	`vLLM`	~2273	~890
0.5B	`llama-cpp`	~1312	~1440
7B	`vLLM`	~354	~5780
7B	`llama-cpp`	~277	~7200

Test Environment

GPU: NVIDIA L40S

Precision: FP16

Workload: 256 input / 256 output tokens

Concurrency: 8

Max requests: 64

vLLM Installation Instructions

Create a vLLM directory in your /scratch directory, then install the vLLM image:

apptainer pull docker://vllm/vllm-openai:latest

Avoid filling up your `$HOME` directory

To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space:

export HF_HOME=/scratch/$USER/hf_cache
export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache

You should run this to configure vLLM to always use your $SCRATCH storage for consistent use:

echo "export HF_HOME=/scratch/\$USER/hf_cache" >> ~/.bashrc
echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc

note

Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME!

Run vLLM

Online Serving (OpenAI-Compatible API)

vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. In Terminal 1: Start vLLM server (In this example we use Qwen model):

apptainer exec --nv vllm-openai_latest.sif vllm serve "Qwen/Qwen2.5-0.5B-Instruct"

When you see:

Application startup complete.

Open another terminal and log in to the same computing node as in terminal 1.

In Terminal 2

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "messages": [
            {"role": "user", "content": "Your prompt..."}
        ]
    }'

Offline Inference

If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class. For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.

from vllm import LLM

# Initialize the vLLM engine.
llm = LLM(model="facebook/opt-125m")

After initializing the LLM instance, use the available APIs to perform model inference.

SGLang: A Simple Option for Offline Batch Inference

For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs. For more details and examples, see the official SGLang offline engine documentation here: https://docs.sglang.io/basic_usage/offline_engine_api.html

`vLLM` CLI

The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:

vllm --help

Serve - Starts the vLLM OpenAI Compatible API server.

vllm serve meta-llama/Llama-2-7b-hf

Chat - Generate chat completions via the running API server.

# Directly connect to localhost API without arguments
vllm chat

# Specify API url
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1

# Quick chat with a single prompt
vllm chat --quick "hi"

Complete - Generate text completions based on the given prompt via the running API server.

# Directly connect to localhost API without arguments
vllm complete

# Specify API url
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1

# Quick complete with a single prompt
vllm complete --quick "The future of AI is"

For more CLI command references: visit https://docs.vllm.ai/en/stable/cli/.

What is vLLM?​

Why vLLM?​

Test Environment​

vLLM Installation Instructions​

Avoid filling up your $HOME directory​

Run vLLM​

Online Serving (OpenAI-Compatible API)​

Offline Inference​

SGLang: A Simple Option for Offline Batch Inference​

vLLM CLI​